One of the repositories I maintain is a beginner’s GitHub repo. New developers can make their first pull request by adding their GitHub handle to a simple text file.
When pull requests get merged into the master branch, they often contain duplicates. The file has more than 7,000 lines. Names are not sorted alphabetically.
I needed a simple way to remove all duplicates lines from the file without sorting the lines.
I’m using awk
, a Unix shell program. I’m not proficient in using awk
, but I’ve found useful one-liners that do what I want.
For reference, this is how my file should look like:
# CONTRIBUTORS
- [@RupamG](https://github.com/RupamG)
- [@hariharen9](https://github.com/hariharen9)
- [@clevermiraz](https://github.com/clevermiraz)
- [@smeubank](https://github.com/smeubank)
- [@LJones95](https://github.com/LJones95)
- [@shannon-nz](https://github.com/shannon-nz)
- [@sammiepls](https://github.com/sammiepls)
Here’s how it often looks like:
# CONTRIBUTORS
- [@RupamG](https://github.com/RupamG)
- [@hariharen9](https://github.com/hariharen9)
- [@clevermiraz](https://github.com/clevermiraz)
- [@smeubank](https://github.com/smeubank)
- [@LJones95](https://github.com/LJones95)
- [@hariharen9](https://github.com/hariharen9)
- [@shannon-nz](https://github.com/shannon-nz)
- [@sammiepls](https://github.com/sammiepls)
- [@shannon-nz](https://github.com/shannon-nz)
1. Remove all empty lines
awk 'NF > 0' file.txt
NF
is the Number of Fields Variable.
2. Remove duplicates
awk '!seen[$0]++' file.txt
I stole this command from opensource.com, where you can find an explanation on how it works.
3. Add Empty Lines Again
awk '{print; print "";}' file.txt
See Stackexchange.