How to convert Windows end of line in Unix end of line (CR/LF to LF)
I’m a Java developer and I’m using Ubuntu to develop. The project was created in Windows with Eclipse and it’s using the Windows-1252 encoding. To convert to UTF-8 I’ve used the recode program:
find Web -iname \*.java | xargs recode CP1252. UTF-8
recode: Web/src/br/cits/projeto/geral/presentation/GravacaoMessageHelper.java failed: Ambiguous output in step `CR-LF..data
Convert line endings from CR/LF to a single LF: Edit the file with Vim, give the command :set ff=unix and save the file. Recode now should run without errors.
Nice, but I’ve many files to remove the CR/LF character from, and I can’t open each to do it. Vi doesn’t provide any option to command line for Bash operations. Can sed be used to do this? How?
recode produces this error when trying to recode a file with mixed dos ( \r\n — CRLF) and unix ( \n LF) newline coding. Unfortunatelly fromdos , formerly a binary, is currently an alias to recode which has this problem.
8 Answers 8
There should be a program called dos2unix that will fix line endings for you. If it’s not already on your Linux box, it should be available via the package manager.
i’ve instaled tofrodos that provide fromdos command, but the problem persist. fromdos -a GravacaoMessageHelper.java; recode CP1252. UTF-8 GravacaoMessageHelper.java returns: recode: GravacaoMessageHelper.java failed: Ambiguous output in step `CR-LF..data’
@MaikoID: Then you have bigger problems. recode shouldn’t care about line endings anyway, as a CR is just another character to convert. And it doesn’t seem to care on my machine.
fromdos is just an alias to recode , and that will produce the error OP mentioned on files with mixed dos (\r\n — CRLF) and unix (\n LF) coding. Only dos2unix works universally.
Just to follow up on this, I ran into the same problem and ended up using the following: find ./ -name «*.java» -exec dos2unix <> + .
sed cannot match \n because the trailing newline is removed before the line is put into the pattern space, but it can match \r , so you can convert \r\n (DOS) to \n (Unix) by removing \r:
Warning: this will change the original file
However, you cannot change from Unix EOL to DOS or old Mac ( \r ) by this. More readings here:
+1 This is a nice solution! But you should note that sed -i will change the original file! Because people wouldn’t expect sed to behave so, so warning is appropriate here. Not many people know -i so they will try sed -i . file > file2 and don’t expect the original file to be modified.
Not all sed variants recognize the nonstandard symbolic sequence \r . Try with a literal ctrl-M character in that case (in many shells, type ctrl-V ctrl-M to produce the literal control character).
Is this safe to use on linux files as well? So if you are unsure, you can just run it over without checking first?
Actually, Vim does allow what you’re looking for. Enter Vim, and type the following commands:
:args **/*.java :argdo set ff=unix | update | next
The first of these commands sets the argument list to every file matching **/*.java , which is all Java files, recursively. The second of these commands does the following to each file in the argument list, in turn:
- Sets the line-endings to Unix style (you already know this)
- Writes the file out iff it’s been changed
- Proceeds to the next file
This is probably much slower than using dos2unix in a for-loop, but it’s still nice to know how to do it in Vim!
I’ll take a little exception to jichao’s answer. You can actually do everything he just talked about fairly easily. Instead of looking for a \n , just look for carriage return at the end of the line.
To change from Unix back to DOS, simply look for the last character on the line and add a form feed to it. (I’ll add -r to make this easier with grep regular expressions.)
Theoretically, the file could be changed to Mac style by adding code to the last example that also appends the next line of input to the first line until all lines have been processed. I won’t try to make that example here, though.
Warning: -i changes the actual file. If you want a backup to be made, add a string of characters after -i . This will move the existing file to a file with the same name with your characters added to the end.
Update: The Unix to DOS conversion can be simplified and made more efficient by not bothering to look for the last character. This also allows us to not require using -r for it to work:
I like your suggestion, but its just missing a closing single quote. It should be: sed -ri ‘s/(.)$/\1\r/’ $
For converting LF to CRLF, capturing some last character preceding end of line isn’t required and might have impact on performance, as well. In my case it is sufficient to do sed -i ‘s/$/\r/’ $
@ThomasUrban Thank you for that info. I’ve added an update with the simplification to allow people to see it sooner. I’m leaving the original expression, though, so that people who read your comment don’t get confused reading your statement.
The tr command can also do this:
and should be available to you.
You’ll need to run tr from within a script, since it cannot work with file names. For example, create a file myscript.sh:
#!/bin/bash for f in `find -iname \*.java`; do echo "$f" tr -d '\15\32' < "$f" >"$f.tr" mv "$f.tr" "$f" recode CP1252. UTF-8 "$f" done
Running myscript.sh would process all the java files in the current directory and its subdirectories.
You would need to run tr within a bash script, since it can’t work on file names. I’ll edit my answer with a sample script.
Ambiguous output in step `CR-LF..data'
the simple solution might be to add the -f flag to force the conversion.
Try the Python script by Bryan Maupin found here (I’ve modified it a little bit to be more generic):
#!/usr/bin/env python import sys input_file_name = sys.argv[1] output_file_name = sys.argv[2] input_file = open(input_file_name) output_file = open(output_file_name, 'w') line_number = 0 for input_line in input_file: line_number += 1 try: # first try to decode it using cp1252 (Windows, Western Europe) output_line = input_line.decode('cp1252').encode('utf8') except UnicodeDecodeError, error: # if there's an error sys.stderr.write('ERROR (line %s):\t%s\n' % (line_number, error)) # write to stderr try: # then if that fails, try to decode using latin1 (ISO 8859-1) output_line = input_line.decode('latin1').encode('utf8') except UnicodeDecodeError, error: # if there's an error sys.stderr.write('ERROR (line %s):\t%s\n' % (line_number, error)) # write to stderr sys.exit(1) # and just keep going output_file.write(output_line) input_file.close() output_file.close()
You can use that script with
$ ./cp1252_utf8.py file_cp1252.sql file_utf8.sql
Convert line endings [duplicate]
I have been using d2u to convert line endings. After installing Puppy Linux I noticed that it does not come with d2u , but dos2unix . Then I noticed that Ubuntu is missing both by default. What is another way to convert line endings?
2 Answers 2
Using tr
Using perl
Using sed
sed 's/^M$//' windows.txt > unix.txt
sed 's/\r$//' windows.txt > unix.txt
To obtain ^M , you have to type CTRL-V and then CTRL-M .
I learned on mac you cannot use tr to open and write to the same file. That results in a blank file, but writing to a different name works great!
@Loren i think that should be your assumption with any redirection. The destination file is opened before the reading of the source. Some commands let you do «in-place» like sed ‘s -i but use intermediate/backup files anyway
These answers are generally correct and you could add awk ‘1′ windows.txt > unix.tx but be aware that the tr is deleting all \r s from the input, not just those that occur at the end of each line as the perl, sed, and now awk scripts would do.
Redirecting stdout and stdin in the same line kinda messes with my head so I used: cat windows_newlines.txt | tr -d ‘\r’ > unix_newlines.txt
Doing this with POSIX is tricky:
- POSIX Sed does not support \r or \15 . Even if it did, the in place option -i is not POSIX
- POSIX Awk does support \r and \15 , however the -i inplace option is not POSIX
- d2u and dos2unix are not POSIX utilities, but ex is
- POSIX ex does not support \r , \15 , \n or \12
To remove carriage returns:
Those awk scripts are GNU awk only due to multi-character RS (more than 1 char in a RS invokes undefined behavior in POSIX so some POSIX awks will silently drop the $ and retain just the ^ , others can do whatever else they like), they would produce unexpected results when the getline fails, they will only operate on the first line of the input, and they will corrupt the input file in some situations and if they were fixed to operate on all lines would cause an infinite loop in others by writing to the input file as it’s being read. Do not execute those scripts.
This work on the same file, ie. it replace line endings in-place. While the tr solutions require different file as an output.
How to change line-ending settings
I actually find that the 3-rd option works better. Otherwise I often have been in situations when I edit both batch and sh scripts on the same platform (Windows/Linux) and then commit them and Git automatically «fixes» line endings for one platform. No, I prefer to be self-conscious about line endings and commit/checkout them exactly as they are.
Agree with @JustAMartin having the system messing with your line endings is a great way to introduce bugs that will take an entire day to track down and fix, as I just have. All decent editors and IDEs on Windows now fully support LF line endings nowadays, there is no need for this translation.
@Neutrino I wish this was true, but one example of IDE that messes with your line endings (and doesn’t offer a reasonable configuration option to turn this off) is Visual Studio.
8 Answers 8
The normal way to control this is with git config
git config --global core.autocrlf true
For details, scroll down in this link to Pro Git to the section named «core.autocrlf»
If you want to know what file this is saved in, you can run the command:
and the git global config file should open in a text editor, and you can see where that file was loaded from.
Actually, if you re-read your own question, in the copy/pasted excerpts : «1 . («core.autocrlf» is set to «true») . 2 . («core.autocrlf» is set to «input») . 3 . («core.autocrlf» is set to «false»)» so you basically answered your own question? 🙂
Pro git has been absorbed by git-scm. An equivalent link to the git-scm book is the formatting and whitespace configuration section.
Line ending format used in OS:
We can configure git to auto-correct line ending formats for each OS in two ways.
Global Configuration
In Linux/OSX
git config --global core.autocrlf input
This will fix any CRLF to LF when you commit.
In Windows
git config --global core.autocrlf true
This will make sure that, when you checkout in windows, all LF will be converted to CRLF .
.gitattributes File
It is a good idea to keep a .gitattributes file as we don’t want to expect everyone in our team to set their own config. This file should be placed in the repository root and. If it exists, git will respect it.
This will treat all files as text files and convert to OS’s line ending on checkout and back to LF on commit automatically. If you want to specify the line ending explicitly, you can use:
The first one is for checkout and the second one is for commit.
This will treat all .jpg images as binary files, regardless of path. So no conversion needed.
Or you can add path qualifiers:
According to gitattributes documentation setting * text=auto lets git decide whether the content is text or not. Forcing all files to be text should be * text only.
The first one is for checkout and the second one is for commit. Are you sure? I see no documentation supporting this
For a repository setting solution, that can be redistributed to all developers, check out the text attribute in the .gitattributes file. This way, developers dont have to manually set their own line endings on the repository, and because different repositories can have different line ending styles, global core.autocrlf is not the best, at least in my opinion.
For example unsetting this attribute on a given path [. — text] will force git not to touch line endings when checking in and checking out. In my opinion, this is the best behavior, as most modern text editors can handle both type of line endings. Also, if you as a developer still want to do line ending conversion when checking in, you can still set the path to match certain files or set the eol attribute (in .gitattributes) on your repository.
Also check out this related post, which describes .gitattributes file and text attribute in more detail: What’s the best CRLF (carriage return, line feed) handling strategy with Git?